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Abstract 

Cooperative games are those in which both 
agents share the same payoff structure. Value- 
based reinforcement-learning algorithms, such 
as variants of Q-learning, have been applied to 
learning cooperative games, but they only apply 
when the game state is completely observable to 
both agents. Policy search methods are a rea- 
sonable alternative to value-based methods for 
partially observable environments. In this paper, 
we provide a gradient-based distributed poUcy- 
search method for cooperative games and com- 
pare the notion of local optimum to that of Nash 
equilibrium. We demonstrate the effectiveness of 
this method experimentally in a small, partially 
observable simulated soccer domain. 



1 INTRODUCTION 

The interaction of decision makers who share an environ- 
ment is traditionally studied in game theory and economics. 
The game theoretic formalism is very general, and analyzes 
the problem in terms of solution concepts such as Nash 
equilibrium [|l2|, but usually works under the assumption 
that the environment is perfectly known to the agents. 

In reinforcement learning ^ no explicit model of the 
environment is assumed, and learning happens through trial 
and error. Recently, there has been interest in applying 
reinforcement learning algorithms to multi-agent environ- 
ments. For example, Littman describes and analyzes 
a Q-learning-liks algorithm for finding optimal policies 
in the framework of zero-sum Markov games, in which 
two players have strictly opposite interests. Hu and Well- 
man propose a different multi-agent Q-learning algo- 
rithm for general-sum games, and argue that it converges 
to a Nash equiUbrium. 

A simpler, but still interesting case, is when multiple agents 
share the same objectives. A study of the behavior of agents 



employing Q-learning individually was made by Claus and 
Boutilier [^, focusing on the influence of game structure 
and exploration strategies on convergence to Nash equilib- 
ria. In Boutilier's later work [||, an extension of value it- 
eration was developed that allows each agent to reason ex- 
pUcitly about the state of coordination. 

However, all of this research assumes that the agents have 
the ability to completely and reliably observe both the state 
of the environment and the reward received by the whole 
system. Schneider et al. |15| investigate a case of dis- 
tributed reinforcement learning, in which agents have com- 
plete and reliable state observation, but only receive a local 
reinforcement signal. They investigate rules that allow in- 
dividual agents to share reinforcement with their neighbors. 
In this paper we investigate the complementary problem in 
which the agents all receive the shared reward signal, but 
have incomplete, unreliable, and generally different per- 
ceptions of the world state. In such environments, value- 
search methods are generally inappropriate, causing us to 
turn to policy-search methods [|9[ ^ U which we have 
applied previously to single-agent partially observable do- 
mains 

In this paper we describe a gradient-descent policy-search 
algorithm for cooperative multi-agent domains. In this set- 
ting, after each agent performs its action given its obser- 
vation according to some individual strategy, they all re- 
ceive the same payoff. Our objective is to find a learning 
algorithm that makes each agent independently find a strat- 
egy that enables the group of agents to receive the optimal 
payoff. Although this will not be possible in general, we 
present a distributed algorithm that finds local optima in 
the space of the agents' policies. 

The rest of the paper is organized as follows. In section^, 
we give a formal definition of a cooperative multi-agent 
environment. In section ^ we review the gradient descent 
algorithm for policy search, then develop it for the multi- 
agent setting. In section ^ we discuss the different notions 
of optimality for strategies. Finally, we present empirical 
results in section ||. 



2 IDENTICAL PAYOFF GAMES 

An identical payoff stochastic game^{lPSG) [ pH describes 
the interaction of a set of agents with a Markov environ- 
ment in which they all receive the same payoffs. An (IPSG) 
is a tuple {S, vtq , G, T, r), where 5 is a discrete state space; 
ttq is a probability distribution^ over the initial state; G 
is a collection of agents, where an agent i is a 3-tuple, 
(A*, O', B), of its discrete action space A'', discrete obser- 
vation space 0% and observation function : S-^V{0^); 
T:SxA-^ Vis) is a mapping from states of the environ- 
ment and actions of the agents to probability distributions 
over states of the environment^ and r : S x A^TZ is the 
payoff function, where A = Yli the joint action space 
of the agents. When all agents in G have the identity ob- 
servation function B{s) — s for all s G 5, the game is com- 
pletely observable. Otherwise, it is a partially observable 
IPSG (POIPSG). 

In a POIPSG, at each time step: each agent i e (l..m) ob- 
serves Oi(t) corresponding to B^{s{t)) and selects an ac- 
tion ai{t) according to its strategy; a compound action 
a{t) — {ai{t), . . . , am{t)) from the joint action space A is 
performed, inducing a state transition of the environment; 
and the identical reward r{t) is received by all agents. 

The objective of each agent is to choose a strategy that 
maximizes the value of the game. For a discount factor 
7 6 [0, 1) and a set of strategies /2 = (/^i, . . . ,/im), the 
value of the game is 



(1) 



In the general case, a strategy for some agent is a mapping 
from the history of all observations from the beginning of 
the game into the current action a{t). We limit our consid- 
eration in this paper to cases in which the agent's actions 
may depend only on the current observation, or in which 
the agent has a finite internal memory. When actions de- 
pend only on the current observation, the policy is called 
a memoryless or reactive policy. When this dependence is 
probabilistic, we call it a stochastic reactive policy, other- 
wise a deterministic reactive policy. 

Note that in a completely observable IPSG, reactive poli- 
cies are sufficient to implement the best possible joint strat- 
egy. This follows directly from the fact that every MDP has 
an optimal deterministic reactive policy [ p^ . Therefore an 
MDP with the product action space A* corresponding to 
a completely observable IPSG also has one, representable 



IPSG's are also called stochastic games |j7p, Markov 
games |^ and multi-agent Markov decision processes [^]. 

^Without loss of generality, we assume a degenerate distribu- 
tion with one initial state and omit it in our notation. 

'^Viyi) denotes the set of probability distributions defined on 
some space Q.. 




Figure 1: An influence diagram for two agents with FSCs 
in a POIPSG. 



by deterministic reactive policies for each agent. However, 
it has been shown that in partially observable environments, 
the best reactive policy can be arbitrarily worse than the 
best policy using memory [p^. This statement can also be 
easily extended to POlPSGs. 

There are many possibilities for constructing policies with 
memory [|l3], |lO|. In this work we use & finite state con- 
troller (fsc) for each agent. A more detailed description of 
FSCs and derivation of algorithms for learning them may 
be found in a previous paper [101; we simply state the def- 
inition here. 



A finite state controller (FSC) for an agent with action 
space A and observation space O is a tuple (A^, ii^ , rj, tp), 
where is a finite set of internal controller states, Tr,^^ 
is a probability distribution over the initial internal state, 
11 : N xO ^ V{N) is the internal state transition function 
that maps an internal state and observation into a probabil- 
ity distribution over internal states, and ip : N ^ ^(^) is 
the action function that maps an internal state into a prob- 
ability distribution over actions. Figure |l] depicts an influ- 
ence diagram for two agents controlled by FSCs. 

Note that in partially observable environments, agents con- 
trolled by FSCs might not have enough memory to even 
represent an optimal policy which could, in general, require 
infinite memory, as in a partially observable Markov deci- 
sion process (POMDP) |17|. In this paper, we concentrate 
on the problem of finding the (locally) optimal controller 
from the class of FSCs with some fixed size of memory. 

To better understand iPSGs, let us consider an example 
from Boutilier [^, illustrated in figure ^ There are two 
agents, ai and a2, each of which has a choice of two ac- 
tions, a and b, at any of three states. All transitions are 
deterministic and are labeled by the joint action that corre- 
sponds to the transition. For instance, the joint action (a, b) 
corresponds to the first agent performing action a and the 
second agent performing action b. Here, * refers to any 




+10 



-10 



(s6 ) +5 



Figure 2: A coordination problem in a completely observ- 
able identical payoff game. 

action taken by the corresponding agent. 

The starting state is si, where the first agent alone decides 
whether to move the environment to state s2 by performing 
action a or to state s3 by performing action h. In state s3, 
no matter what both agents do as the next step, they receive 
a reward of +5 in state s6 risk-free. In state s2, the agents 
have a choice of cooperating — choosing the same action, 
whether (a, a) or {h,h) — with reward +10 in state s4, or 
not — choosing different actions, whether (a, b) or (&, a) — 
and getting —10 in state .s5. 

We will represent a joint policy with parameters Pgt^™*' 
denoting the probability that an agent will perform action 
a in the corresponding state. Only three parameters are 
important for the outcome: {p\tP2]P2}- The optimal joint 
policies are {1, 1; 1} or {1,0; 0}, which are deterministic 
reactive policies. 

3 GRADIENT DESCENT FOR POLICY 
SEARCH 

In this section, we first introduce a general method for us- 
ing gradient descent in policy spaces, then show how it can 
be applied to multi-agent problems. 

3.1 Basic Algorithm 

Williams introduced the notion of policy search for rein- 
forcement learning in his REINFORCE algorithm [|l9|, pO| , 
which was generalized to a broader class of error criteria 
by Baird and Moore [|[ |]. 

We will start by considering the case of a single agent in- 
teracting with a POMDP. The agent's policy /i is assumed 
to depend on some internal state taking on values in finite 
set N . We will not make any further commitment to de- 
tails of the policy's architecture, as long as it defines the 
probability of action given past history as a continuous dif- 
ferentiable function of some set of parameters w. 

First, we will establish some notation. We denote by Ht the 
set of all possible experience sequences of length t: h = 

(n(0), o(l), 7i(l), a(l), r(l), . . ., o(i), n{t), a(i), r{t), 0(1+1)). 



In order to specify that some element is a part of the 
history h at time r, we write, for example, r{T,h) and 
a(T, h) for the t*^ reward and action in the history h. 
We will also use h'^ to denote a prefix of the sequence 

h G Ht truncated at time t <t : (n(0), o(l), n(l), 

a(l), r'(l), . . . , o(t), ^(t), a(T), r(T), o(r+l)). The value 
defined by equation (|l]) can be rewritten as 



(2) 



heHt 



Let us assume the policy is expressed parametrically in 
terms of a vector of weights w = {lui, . . . , wm}- If we 
could calculate the derivative of V for each Wk, it would be 
possible to do an exact gradient descent on value V by mak- 
ing updates Auifc — a-^^V. We can compute the deriva- 
tive for each weight Wk, 



dwk 



t=l heHt 



rit,h) 



^Fr{h\^l) 
dwk 



= E^* E Pr(MMMi,M 
^ 91nPr (n(T, h), a{T, h) \ h'^~^,fi) 



But, in the spirit of reinforcement learning, we cannot as- 
sume the knowledge of a world model that would allow us 
to calculate Pr{h \ /i), so we must retreat to stochastic gra- 
dient descent instead. We sample from the distribution of 
histories by interacting with the environment, and calculate 
during each trial an estimate of the gradient, accumulating 
the quantities: 

t u .x'^ 51nPr(n(T,/i),a(T,/i) I /i^-i,^) 
jr{t,h)y ^ , (3) 

^ OWk 

T—l 

for all t. For a particular policy architecture, this can be 
readily translated into a gradient descent algorithm that is 
guaranteed to converge to a local optimum of V. 

3.2 Central Control Of Factored Actions 

Now let us consider the case in which the action is factored, 
meaning that each action a consists of several components 
a = (oi , . . . , ttm)- We can consider two kinds of control- 
lers: a joint controller is a policy mapping observations to 
the complete joint distribution 7r(a); a factored controller 
is made up of independent sub-policies /Iq. : O* V{a.i) 
(possibly with a dependence on individual internal state) 
for each action component. 

Factored controllers can represent only a subset of the poli- 
cies represented by joint controllers. Obviously, any prod- 
uct of policies for the factored controller Hi Ma; can be 



represented by a joint controller /ij, for which Pr(a) = 
Jl^^j^ Pr(ai). However, there are some stochastic joint 
controllers that cannot be represented by any factored con- 
troller, because they require coordination of probabilistic 
choice across action components, which we illustrate by 
the following example. 

The first action component controls the liquid component 
of a meal ai e {vodka, milk} and the second controls the 
solid one 02 G {pickles, cereal}. For the sake of argument, 
let us assume that sticking to one combination or another 
is not as good as a "mixed strategy", meaning that for a 
healthy diet, we sometimes want to eat milk with cereal, 
other times vodka with pickles. The optimal policy is ran- 
domized, say 10% of the time a = [vodka, pickles) and 
90% of the time a = [milk, cereal). But when the com- 
ponents are controlled independently, we cannot represent 
this policy. With randomization, we are forced to drink 
vodka with cereal or milk with pickles on some occasions. 

Because we are interested in individual agents learning in- 
dependent policies, we concentrate on learning the best fac- 
tored controller for a domain, even if it is suboptimal in a 
global sense. Requiring a controller to be factored simply 
puts constraints on the class of policies, and therefore dis- 
tributions P{a I /i, h), that can be represented. The stochas- 
tic gradient-descent techniques of the previous section can 
still be applied directly in this case to find local optima in 
the controller space. We will call this method joint gradient 
descent. 



3.3 Distributed Control Of Factored Actions 

The next step is to learn to choose action components 
not centrally, but under the distributed control of multiple 
agents. One obvious strategy would be to have each agent 
perform the same gradient-descent algorithm in parallel to 
adapt the parameters of its own local policy ^ai- Per- 
haps surprisingly, this distributed gradient descent (dgd) 
method is very effective. 

Theorem 1 For factored controllers, distributed gradient 
descent is equivalent to joint gradient descent. 

Proof: We will show that for both controllers the algorithm 
wiU be stepwise the same, so starting from the same point 
in the search space, on the same data sequence, the algo- 
rithms will converge to the same locally optimal parameter 
setting. For simplicity we assume a degenerate set of inter- 
nal controller states N . Then a factored controller, h can be 
described as (oi(l), o„i(l), ai(l), a„i(l), r(l), ...). 
The corresponding history for an individual agent iis hi = 
(oi(l), ai{l), r(l), ...). It is clear that a collection hi...hm 
of individual histories, one for each agent, specifies the 
joint history h. 



The joint gradient descent algorithm requires that we draw 
sample histories from Pr{h \ jl) and that we do gradient 
descent on w with a sample of the gradient at each time 
step t in the history equal to 



l'r{t,h) 



t 

E 



91nPr(a(T,/t) | h^-\fi) 
dw 



Whether a factored controller is being executed by a sin- 
gle agent, or it is implemented by agents individually exe- 
cuting policies iiai in parallel, joint histories are generated 
from the same distribution Pr{h \ (/la^, ...,fia„^)). So the 
distributed algorithm is sampling from the correct distribu- 
tion. 

Now, we must show that the weight updates are the same 
in the distributed algorithm as in the joint one. Let Wk = 



) be the set of parameters controlling action 



component afc. Then 



d 
dwj. 



In Pr a,(r) | hj 



= for all k^l 



that is, the action probabilities of agent I are independent of 
the parameters in other agents' policies. With this in mind, 
for factored controllers, the derivative in expression be- 
comes 



d 
dwl 



InPr (d{T,h) I h^^\fl 



= -J^^l[^H^^i^^h)\hr\^^^) 

1=1 c'wj. 



d 
dwi 



InFv {akiT,h) \ hl-\ tik) 



Thus, the same weight updates will be performed by DGD 
as by joint gradient descent on a factored controller ■ 

This theorem shows that policy learning and control over 
component actions can be distributed among independent 
agents who are not aware of each others' choice of actions. 
An important requirement, though, is that agents perform 
simultaneous learning (which might be naturally synchro- 
nized by the coming of the rewards). 

4 RELATING LOCAL OPTIMA TO NASH 
EQUILIBRIA 

In game theory, the Nash equilibrium is a common solution 
concept. Because gradient descent methods can often be 
guaranteed to converge to local optima in the policy space, 
it is useful to understand how those points are related to 



Nash equilibria. We will Umit our discussion to the two- 
agent case, but the results are generalizable to more agents. 

A Nash equilibrium is a pair of strategies such that devi- 
ation by one agent from its strategy, assuming the other 
agent's strategy is fixed, cannot improve the overall perfor- 
mance. Formally, in an IPSO, a Nash equilibrium point is a 
pair of strategies (/ii, /^2) such that: 

v{{^Jil.^ll))>v{{^il,^il)). 
V{{^il,^^;))>vi{^il,fi2)) 

for all (/ii , /i2). When the inequalities are strict, it is called 
a strict Nash equilibrium. 

Every discounted stochastic game has at least one Nash 
equilibrium point [Q]. It has been shown that under cer- 
tain convexity assumptions about the shape of payoff func- 
tions, the gradient-descent process converges to an equilib- 
rium point [|l]]. It is clear that the optimal Nash equilibrium 
point (the Nash equilibrium with the highest value) in an 
IPSO also is a possible point of convergence for the gradi- 
ent descent algorithm, since it is the global optimum in the 
policy space. 

Let us return to the game described in Figure ^. It has two 
optimal strict Nash equilibria at {1, 1; 1} and {1, 0; 0}. It 
also has a set of sub-optimal Nash equilibria {0,pl',P2}^ 
where P2 can take on any value in the interval [.25, .75] and 
pI can take any value in the interval [0,1]. The sub-optimal 
Nash equilibria represent situations in which the first agent 
always chooses the bottom branch and the second agent 
acts moderately randomly in state s2. In such cases, it is 
strictly better for the first agent to stay on the bottom branch 
with expected value +5. For the second agent, the payoff 
is +5 no matter how it behaves, so it has no incentive to 
commit to a particular action in state s2 (which is necessary 
for the upper branch to be preferred). 

In this problem, the Nash equilibria are also all local op- 
tima for the gradient descent algorithm. Unfortunately, this 
equivalence only holds in one direction in the general case. 
We state this more precisely in the following theorems. 

Theorem 2 Every strict Nash equilibrium is a local opti- 
mum for gradient descent in the space of parameters of a 
factored controller 

Proof: Assume that we have two agents and denote the 
strategy at the point of strict Nash equilibrium as {f^l, pi) 
encoded by parameter vector {wl . . . , . . . W2)- For 
simplicity, let us further assume that /ij) is not on the 
boundary of the parameter space, and each weight is lo- 
cally relevant: that is, that if the weight changes, the policy 
changes, too. 

By the definition of Nash equilibrium, any change in value 
of the parameters of one agent without change in the other 
agent's parameters results in a decrease in the value V. In 



other words, we have that dV/dwl < and —dV/dwl < 
for all j and i at the equilibrium point. Thus, dV/ dw^ — 
for all wl at /ij), which implies it is a singular point 
of V . Furthermore, because the value decreases in every 
direction, it must be a maximum. 

In the case of a locally irrelevant parameter , V will have 
a ridge along its direction. All points on the ridge are sin- 
gular and, although they are not strict local optima, they are 
essentially local optima for gradient descent. ■ 

The problem of Nash equilibria on the boundary of the pa- 
rameter space is an interesting one. Whether or not they are 
convergence points depends on the details of the method 
used to keep gradient descent within the boundary. A par- 
ticular problem comes up when the equilibrium point oc- 
curs when one or more parameters have infinite value (this 
is not uncommon, as we shall see in section In such 
cases, the equilibrium cannot be reached, but it can usually 
be approached closely enough for practical purposes. 

Theorem 3 Some local optima for gradient descent in the 
space of parameters of a factored controller are not Nash 
equilibria. 

Proof: Consider a situation in which each agent's policy 
has a single parameter Wi, so the policy space can be de- 
scribed by (wi,W2/. We can construct a value function 
V(wi,W2) such that for some c, V{-,c) has two modes, 
one at V{a, c) and the other at V{b, c), such that V{b, c) > 
V{a, c). Further assume that V{a, •) and V{b, •) each have 
global maxima V{a, c) and V{b, c). Then V{a, c) is a local 
optimum that is not a Nash equilibrium. ■ 

5 EXPERIMENTS 

There are no established benchmark problems for multi- 
agent learning. To illustrate our method we present em- 
pirical results for two problems: the simple coordination 
problem of figure ^ and a small multi-agent soccer domain. 

5.1 Simple Coordination Problem 

We originally discussed policies for this problem using 
three weights, one corresponding to each of the Ps^^^f 
probabilities. However, to force gradient descent to respect 
the bounds of the simplex, we used the standard Boltzmann 
encoding, so that for agent i in state s there are two weights, 
wl (J and wl ^, one for each action. The action probability 
is coded as a function of these weights as 

We ran DGD with a learning rate of a = .003 and a discount 
factor of 7 = .99; the results are shown in figure |l The 
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Figure 3: Average payoff and payoff of a sample run of a 
distributed gradient descent on the simple problem. 



graph of a sample run illustrates how the agents typically 
initially move towards a sub-optimal policy. The policy 
in which the first agent always takes action h and the sec- 
ond agent acts fairly randomly is a Nash equilibrium, as 
we saw in section ^ However, this policy is not exactly 
representable in the Boltzmann parameterization because it 
requires one of the weights to be infinite to drive a proba- 
bility to either or 1. 

So, although the algorithm moves toward this policy, it 
never reaches it exactly. This means that there is an ad- 
vantage for the second agent to drive its parameter toward 
or 1, resulting in eventual convergence toward a global 
optimum (note that, in this parameterization, these optima 
cannot be reached exactly, either). The average of 10 runs 
shows that the algorithm always converges to a pair of poli- 
cies with value very close to the maximum value of 10. 

5.2 Soccer 

We have also conducted experiments on a small soccer do- 
main adapted from Littman [^. The game is played on 
a 6 X 5 grid as shown in Figure There are two learning 
agents on one team and a single opponent with a fixed strat- 
egy on the other Every time the game begins, the learning 
agents are randomly placed in the right half of the field, 
and the opponent in the left half of the field. Each cell in 
the grid contains at most one player Every player on the 
field (including the opponent) has an equal chance of ini- 
tially possessing the ball. 

At each time step, a player can execute one of the six ac- 
tions: {North, South, East, West, Stay, Pass}. When 
an agent passes, the ball is transferred to the other agent on 
its team on the next time step. Once all players have se- 
lected actions, they are executed in a random order. When 
a player executes an action that would move it into the cell 
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Figure 4: The soccer field. Vi and V2 represent learning 
agents and O represents the opponent. 



occupied by some other player, possession of the ball goes 
to the stationary player and the move does not occur When 
the ball falls into one of the goals, the game ends and a re- 
ward of ±1 is given. 

We made a partially observable version of the domain to 
test the effectiveness of DGD: each agent can only obtain 
information about which player possesses the ball and the 
status of cells to the north, south, east and west of its lo- 
cation. There are 3 possible observations for each cell: 
whether it is open, out of the field, or occupied by some- 
one. In Figure ^, we compare the learning curves of DGD 
to those of Q-learning with a central controller for both the 
completely observable and the partially observable cases. 
We also show learning curves of DGD without the action 
Pass in order to measure the cooperativeness of the learned 
policies. 

The graphs in the figure summarize simulations of the game 
against three different fixed-strategy opponents: 

• Random: Executes actions uniformly at random. 

• Greedy: Moves toward the player possessing the ball and 
stays there. Whenever it has the ball, it rushes to the goal. 

• Defensive: Rushes to the front of its own goal, and stays 
or moves at random, but never leaves the goal area. 

We show the average performance over 10 runs with error 
bars for standard deviation. The learning rate was 0.05 for 
DGD and 0.1 for Q-learning, and the discount factor was 
0.999, throughout the experiments. Each agent in the DGD 
team learned a reactive policy. The policy's parameters 
were initialized by drawing uniformly at random from the 
appropriate domains. We used e-greedy exploration with 
e = 0.4 for Q-learning. The performance in the graph is 
reported by evaluating the greedy policy derived from the 
Q-table. 

Because, in the completely observable case, this domain is 
an MDP (the opponent's strategy is fixed, so it is not really 
an adversarial game), Q-learning can be expected to learn 
the optimal joint policy, which it seems to do. It is interest- 
ing to note the slow convergence of completely observable 
Q-learning against the random opponent. We conjecture 
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Figure 5: Learning curves of DGD (policy search) and Q-learning against a defensive opponent (top left), a greedy opponent 
(top right), a random opponent (bottom left), and a team with two agents with different fixed strategies (bottom right). 



that this is because, against a random opponent, a much 
larger part of the state space is visited. The table-based 
value function offers no opportunity for generalization, so 
it requires a great deal of experience to converge. 

As soon as observability is restricted, Q-learning no longer 
reliably converges to the best strategy. The joint Q-learner 
now has as its input the two local observations of the indi- 
vidual players. It behaves quite erratically, with extremely 
high variance because it sometimes converges to a good 
policy and sometimes to a bad one. This unreliable be- 
havior can be attributed to the well-known problems of us- 
ing value-function approaches, and especially Q-learning, 
in POMDPs. 

The individual DGD agents have stochasticity in their action 
choices, which gives them some representational abilities 
unavailable to the Q-learner We tried additional experi- 
ments in which the DGD agents had 4-state FSC's. Their 
performance did not improve appreciably. We expect that, 
in future experiments on a larger field with more players, it 
will be important for the agents to have internal state. 

Despite the fact that they learn independently, the combi- 
nation of policy search plus a different policy class allows 



them to gain considerably improved performance. We can- 
not tell how close this performance is to the optimal per- 
formance with partial observability, because it would be 
computationally impractical to solve the POIPSG exactly. 
Bernstein et. al. [Q| show that in the finite-horizon case 
two-agent POlPSGs are harder to solve than POMDPs (in 
the worst-case complexity sense). 

It is also important to see that the two DGD agents have 
learned to cooperate in some sense: when the same algo- 
rithm is run in a domain without the "pass" action, which 
allows one agent to give the ball to its teammate, perfor- 
mance deteriorates significantly against both defensive and 
greedy opponents. Since the agents don't know where they 
are with respect to the goal, they probably choose to pass 
whenever they are faced with the opponent. Against a com- 
pletely random opponent, both strategies do equally well. 
It is probably sufficient, in this case, to simply run straight 
for the goal, so cooperation is not necessary. 

We performed some additional experiments in a two-on- 
two domain in which one opponent behaved greedily and 
the other defensively. In this domain, the completely ob- 
servable state space is so large that it is difficult to even 
store the Q table, let alone populate it with reasonable val- 



ues. Thus, we just compare two 4-state DGD agents with 
a Umited-view centrally controlled Q-learning algorithm. 
Not surprisingly, we find that the dgd agents are consider- 
ably more successful. 

Finally, we performed informal experiments with an in- 
creasing number of opponents. The opponent team was 
made up of one defensive agent and an increasing number 
of greedy agents. For all cases in which the opponent team 
had more than two greedy agents, dgd led to a defensive 
strategy in which, most of the time, the agents all rushed to 
the front of their goal and stayed there forever. 

6 CONCLUSIONS AND FUTURE WORK 

We have presented an algorithm for distributed learning in 
cooperative multi-agent domains. It is guaranteed to find 
local optima in the space of factored policies. We cannot 
show, however, that it always converges to a Nash equilib- 
rium, because there are local optima in policy space that 
are not Nash equilibria. The algorithm performed well in a 
small simulated soccer domain. 

It will be important to apply this algorithm in more com- 
plex domains, to see if the gradient remains strong enough 
to drive the search effectively, and to see whether local op- 
tima become problematic. An interesting extension of this 
work would be to allow the agents to perform explicit com- 
munication actions with one another to see if they are ex- 
ploited to improve performance in the domain. 

In addition, there may be more interesting connections to 
establish with game theory, especially in relation to solu- 
tion concepts other than Nash equilibrium, which may be 
more appropriate in cooperative games. 
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